Maximum Inner Product Search (MIPS) is a popular problem in the machine learning literature due to its applicability in a wide array of applications, such as recommender systems. In high-dimensional settings, however, MIPS queries can become computationally expensive as most existing solutions do not scale well with data dimensionality. In this work, we present a state-of-the-art algorithm for the MIPS problem in high dimensions, dubbed BanditMIPS. BanditMIPS is a randomized algorithm that borrows techniques from multi-armed bandits to reduce the MIPS problem to a best-arm identification problem. BanditMIPS reduces the complexity of state-of-the-art algorithms from $O(\sqrt{d})$ to $O(\text{log}d)$, where $d$ is the dimension of the problem data vectors. On high-dimensional real-world datasets, BanditMIPS runs approximately 12 times faster than existing approaches and returns the same solution. BanditMIPS requires no preprocessing of the data and includes a hyperparameter that practitioners may use to trade off accuracy and runtime. We also propose a variant of our algorithm, named BanditMIPS-$\alpha$, which employs non-uniform sampling across the data dimensions to provide further speedups.
translated by 谷歌翻译
Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decision trees. Our algorithm borrows techniques from the multi-armed bandit literature to judiciously determine how to allocate samples and computational power across candidate split points. We provide theoretical guarantees that MABSplit improves the sample complexity of each node split from linear to logarithmic in the number of data points. In some settings, MABSplit leads to 100x faster training (an 99% reduction in training time) without any decrease in generalization performance. We demonstrate similar speedups when MABSplit is used across a variety of forest-based variants, such as Extremely Random Forests and Random Patches. We also show our algorithm can be used in both classification and regression tasks. Finally, we show that MABSplit outperforms existing methods in generalization performance and feature importance calculations under a fixed computational budget. All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest.
translated by 谷歌翻译
当代编码教育往往为学生提供开发具有用户交互和复杂动态系统的计划的任务,例如基于鼠标的游戏。在教学上引人注目的同时,没有现代的自主方法来提供反馈。值得注意的是,通过传统的单元测试,互动计划不可能等级。在本文中,我们正规化为互动计划提供反馈作为分类马尔可夫决策过程(MDP)的任务的挑战。每个学生的程序都完全指定了一个MDP,其中代理需要在合理的概括下运行和决定,如果输入MDP的动态和奖励模型应该被分类为正确或损坏。我们证明,通过在代理和自回归模型之间设计合作目标,我们可以使用代理从输入MDP采样差分轨迹,允许分类器确定成员资格:播放到等级。我们的方法使自动反馈系统能够进行交互式代码分配。我们将711,274个匿名学生提交的数据集发布到单个分配的单个分配,以支持未来的研究。
translated by 谷歌翻译
项目反应理论(IRT)是一个无处不在的模型,可以根据他们对问题的回答理解人类行为和态度。大型现代数据集为捕捉人类行为的更多细微差别提供了机会,从而有可能改善心理测量模型,从而改善科学理解和公共政策。但是,尽管较大的数据集允许采用更灵活的方法,但许多用于拟合IRT模型的当代算法也可能具有禁止现实世界应用的巨大计算需求。为了解决这种瓶颈,我们引入了IRT的变异贝叶斯推理算法,并表明它在不牺牲准确性的情况下快速可扩展。将此方法应用于认知科学和教育的五个大规模项目响应数据集中,比替代推理算法更高的对数可能性和更高的准确性。然后,使用这种新的推论方法,我们将IRT概括为具有表现力的贝叶斯响应模型,利用深度学习的最新进展来捕获具有神经网络的非线性项目特征曲线(ICC)。使用TIMSS的特定级数学测试,我们显示我们的非线性IRT模型可以捕获有趣的不对称ICC。该算法实现是开源的,易于使用。
translated by 谷歌翻译
AI正在经历范式转变,随着模型的兴起(例如Bert,Dall-E,GPT-3),这些模型经过大规模的数据训练,并且可以适应广泛的下游任务。我们称这些模型基础模型来强调其至关重要但不完整的特征。该报告提供了基础模型的机会和风险的详尽说明,包括其功能(例如语言,愿景,机器人技术,推理,人类互动)和技术原则(例如,模型架构,培训程序,数据,系统,安全,安全性,评估,理论)对其应用(例如法律,医疗保健,教育)和社会影响(例如不平等,滥用,经济和环境影响,法律和道德考虑)。尽管基础模型基于标准的深度学习和转移学习,但它们的规模导致了新的新兴能力,以及它们在许多任务中的有效性都激发了同质化。同质化提供了强大的杠杆作用,但要求谨慎,因为基础模型的缺陷均由下游的所有适应模型继承。尽管即将广泛地部署基础模型,但我们目前对它们的工作方式,失败以及由于其新兴属性的影响而缺乏清晰的了解。为了解决这些问题,我们认为基础模型的许多批判性研究都需要与他们的基本社会技术性质相称。
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.
translated by 谷歌翻译
Traditionally, data analysis and theory have been viewed as separate disciplines, each feeding into fundamentally different types of models. Modern deep learning technology is beginning to unify these two disciplines and will produce a new class of predictively powerful space weather models that combine the physical insights gained by data and theory. We call on NASA to invest in the research and infrastructure necessary for the heliophysics' community to take advantage of these advances.
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
As text generated by large language models proliferates, it becomes vital to understand how humans engage with such text, and whether or not they are able to detect when the text they are reading did not originate with a human writer. Prior work on human detection of generated text focuses on the case where an entire passage is either human-written or machine-generated. In this paper, we study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models. We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time. Furthermore, we conduct a detailed comparison study and analyze how a variety of variables (model size, decoding strategy, fine-tuning, prompt genre, etc.) affect human detection performance. Finally, we collect error annotations from our participants and use them to show that certain textual genres influence models to make different types of errors and that certain sentence-level features correlate highly with annotator selection. We release the RoFT dataset: a collection of over 21,000 human annotations paired with error classifications to encourage future work in human detection and evaluation of generated text.
translated by 谷歌翻译